API: consistent NaN treatment for pyarrow dtypes #61732

jbrockmendel · 2025-06-28T17:23:26Z

This is the third of several POCs stemming from the discussion in #61618 (see #61708, #61716). The main goal is to see how invasive it would be.

Specifically, this changes the behavior of pyarrow floating dtypes to treat NaN as distinct from NA in the constructors and __setitem__ (xref #32265). Also in to_numpy, .values

Notes:

~~This makes the decision to treat NaNs as close-enough to NA when a user explicitly asks for a pyarrow integer dtype. I think this is the right API, but won't check the box until there's a concensus.~~ Changed this following Matt's opinion.
I still have ~~113~~ 89 9 0 failing tests locally. ~~Most of these are in json, sql, or test_EA_types (which is about csv round-tripping).~~
Finding the mask to pass to pa.array needs optimization.
~~The kludge in NDFrame.where is ugly and fragile.~~ Fixed.
Need to double-check the new expected in the rank test. Maybe re-write the test with NA instead of NaN?
Do we change to_numpy() behavior to not convert NAs to NaNs? this would be needed to make test_setitem_frame_2d_values tests pass

jbrockmendel · 2025-06-30T15:19:46Z

@mroeschke when convenient id like to get your thoughts before getting this working. it looks pretty feasible.

pandas/core/arrays/arrow/array.py

mroeschke · 2025-06-30T17:03:05Z

Generally +1 in this direction. Glad to see the changes to make this work are fairly minimal

rhshadrach

+1 as well; this is nice.

…estamp type

Dr-Irv · 2025-07-21T19:28:17Z

Not able to judge the implementation, but I'm +1 on the concept.

jorisvandenbossche · 2025-07-22T23:08:27Z

pandas/core/arrays/masked.py

+        dtype, na_value = to_numpy_dtype_inference(
+            self, dtype, na_value, hasna, is_pyarrow=False


This change means to use object dtype instead of converting NA to NaNs?

We initially did that for the masked arrays conversion to numpy, but then changed it use NaNs, because constantly getting object dtype was too annoying (there is some issue discussing this IIRC)

jorisvandenbossche · 2025-07-22T23:20:35Z

While I am personally in favor of distinguishing NaN and NA, I think most of the changes here involve distinguishing NaN when constructing the arrays? (so eg constructing the pyarro-based EA from user input like numpy arrays?)

Personally, I think that is a change we should only make after making those dtypes the default, and probably even years after that after a very long deprecation process.
(currently everyone who is creating pandas DataFrames from numpy data assumes that the NaNs in the numpy data is considered as missing. IMO that is a behaviour that we will have to keep (for a long time) even if we distinguish NaN and NA)

jbrockmendel · 2025-07-23T15:50:16Z

I think most of the changes here involve distinguishing NaN when constructing the arrays?

Yes. Constructors (which affect read_csv) and __setitem__ are most of this.

I think that is a change we should only make after making those dtypes the default, and probably even years after that after a very long deprecation process.

My current thought (will bring up on today's dev call) is that we should add a global flag to enable both never-distinguish (see #61708) as the default and always distinguish (this) as opt-in.

jbrockmendel · 2025-07-31T16:40:11Z

Based on last week's dev call, I am adapting this and #61708 from POCs to real PRs. This implements a global flag "mode.nan_is_na" (default True) to choose which behavior we want.

This PR only implements this for ArrowEA. #61708 will do the same for the numpy-nullables. (I have a branch trying to do it all at once and it is getting ungainly). A third PR will add tests for the various issues this closes.

mroeschke reviewed Jun 30, 2025

View reviewed changes

pandas/core/arrays/arrow/array.py Outdated Show resolved Hide resolved

mroeschke reviewed Jun 30, 2025

View reviewed changes

pandas/core/arrays/arrow/array.py Outdated Show resolved Hide resolved

This was referenced Jul 1, 2025

Moving to PyArrow dtypes by default #61618

Open

BUG: Decimal(NaN) incorrectly allowed in ArrowEA constructor with tim… #61773

Merged

Request For Help: unexplained ArrowInvalid overflow #61776

Closed

jbrockmendel force-pushed the poc-arrow-nans branch from f1e8ba0 to ca6e8e8 Compare July 5, 2025 16:41

jbrockmendel mentioned this pull request Jul 5, 2025

PERF: avoid object-dtype path in ArrowEA._explode #61786

Merged

jbrockmendel force-pushed the poc-arrow-nans branch 2 times, most recently from f547fab to 02d12a3 Compare July 7, 2025 20:00

rhshadrach reviewed Jul 7, 2025

View reviewed changes

jbrockmendel added 13 commits July 7, 2025 14:57

BUG: Decimal(NaN) incorrectly allowed in ArrowEA constructor with tim…

31e65e0

…estamp type

GH ref

9dcd8fb

BUG: ArrowEA constructor with timestamp type

3fb47c7

POC: consistent NaN treatment for pyarrow dtypes

c18ab05

comment

74a2248

Down to 40 failing tests

9d8fef4

Fix rank, json tests

f47c746

CLN: remove outdated

083f705

Fix where kludge

a340203

update tests

587e53f

Fix remaining tests

734465c

mypy fixup

d2aeeff

old-numpy compat

73a95d2

jbrockmendel force-pushed the poc-arrow-nans branch from 02d12a3 to 73a95d2 Compare July 7, 2025 21:59

simplify

ce28027

simonjayhawkins added the PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Jul 22, 2025

jorisvandenbossche reviewed Jul 22, 2025

View reviewed changes

jbrockmendel added 3 commits July 30, 2025 13:37

Merge branch 'main' into poc-arrow-nans

9300ad0

Merge branch 'main' into poc-arrow-nans

76bc3d2

Better option name, fixture

0327507

jbrockmendel marked this pull request as ready for review July 31, 2025 16:36

default True

c0bdd67

jbrockmendel added 5 commits July 31, 2025 11:15

Patch ops

2467f6e

mypy fixup

6356cc0

Test for setitem/construction

09e5bf5

update ufunc test

ce36571

Improve rank test skips

5bc2617

jbrockmendel changed the title ~~POC: consistent NaN treatment for pyarrow dtypes~~ API: consistent NaN treatment for pyarrow dtypes Jul 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API: consistent NaN treatment for pyarrow dtypes #61732

API: consistent NaN treatment for pyarrow dtypes #61732

Uh oh!

jbrockmendel commented Jun 28, 2025 •

edited

Loading

Uh oh!

jbrockmendel commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

mroeschke commented Jun 30, 2025

Uh oh!

rhshadrach left a comment

Uh oh!

Dr-Irv commented Jul 21, 2025

Uh oh!

jorisvandenbossche Jul 22, 2025

Uh oh!

jbrockmendel Jul 23, 2025

Uh oh!

jorisvandenbossche commented Jul 22, 2025

Uh oh!

jbrockmendel commented Jul 23, 2025

Uh oh!

jbrockmendel commented Jul 31, 2025

Uh oh!

Uh oh!

		dtype, na_value = to_numpy_dtype_inference(
		self, dtype, na_value, hasna, is_pyarrow=False

Uh oh!

API: consistent NaN treatment for pyarrow dtypes #61732

Are you sure you want to change the base?

API: consistent NaN treatment for pyarrow dtypes #61732

Uh oh!

Conversation

jbrockmendel commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

mroeschke commented Jun 30, 2025

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Dr-Irv commented Jul 21, 2025

Uh oh!

jorisvandenbossche Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jul 22, 2025

Uh oh!

jbrockmendel commented Jul 23, 2025

Uh oh!

jbrockmendel commented Jul 31, 2025

Uh oh!

Uh oh!

jbrockmendel commented Jun 28, 2025 •

edited

Loading